Report crashing OTEL process cleanly with proper status reporting #11448

blakerouse · 2025-11-26T16:41:36Z

What does this PR do?

At the moment when a spawned OTEL subprocess fails it is just reported as exit code 1. It provides no information of what has failed and marks the entire Elastic Agent as failed.

This changes that behavior by looking at the actual output to determine the issue and report a proper component status for the entire configuration. It does this by parsing the configuration and building its own aggregated status for the component graph that would be returned by the healthcheckv2 extension if it could successfully run. It inspects the error message to determine if it can correlate the error to a specific component in the graph. If it cannot it falls back to reporting error on all components.

Why is it important?

The Elastic Agent needs to provide clean status reporting even when the subprocess fails to run. It needs to not mark the entire Elastic Agent in error when that happens as well.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~[ ] I have added an integration test or an E2E test~~(covered by unit tests)

Disruptive User Impact

None

How to test this PR locally

Use either an invalid OTEL configuration or one that will fail to start. Observer that when running elastic-agent run with that OTEL configuration in the elastic-agent.yml filed (aka. Hybrid Mode) that elastic-agent status --output=full provides correct state information.

Related issues

Closes [beats receivers] Supervised collector failing to start only reports OTel manager failed: supervised collector (pid: 94473) exited with error: exit status 1' #11173

elasticmachine · 2025-11-26T16:41:40Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine · 2025-11-26T19:21:14Z

💚 Build Succeeded

Buildkite Build
Commit: adbed0d

History

💔 Build #31098 failed e242f0c

cc @blakerouse

swiatekm

The logic looks good to me, but I have some concerns about deadlocks in the otel manager caused by trying to report status from the main loop.

swiatekm · 2025-11-27T16:00:15Z

internal/pkg/otel/manager/testing/testing.go

 	exitCode := 0
 	err = cmd.RunCollector(ctx, nil, true, "debug", monitoringURL)
 	if err != nil && !errors.Is(err, context.Canceled) {
+		fmt.Fprintln(os.Stderr, err)


FYI, if you use otelcol builder, its generated main.go uses log.Fatalf here. We should do the same if we can.

swiatekm · 2025-11-27T16:20:11Z

internal/pkg/otel/manager/common.go

+	// pipelines
+	for pipelineID, pipelineCfg := range c.Service.Pipelines {
+		for _, recvID := range pipelineCfg.Receivers {
+			instanceID := componentstatus.NewInstanceID(recvID, component.KindReceiver, pipelineID)


I think it's worth commenting that the upstream graph building code creates a single instance id for each receiver, containing information about all the pipelines it appears in. The status reporting then makes a copy of the status for each pipeline, so it's fine for us to do so here as well, but it's worth calling out imo.

swiatekm · 2025-11-27T16:24:48Z

internal/pkg/otel/manager/common.go

+	forIDStr := fmt.Sprintf("for id: %q", instanceID.ComponentID().String())
+	failedMatchStr := fmt.Sprintf("failed to start %q %s:", instanceID.ComponentID().String(), strings.ToLower(instanceID.Kind().String()))


It'd be good to specify what conditions these lines refer to. The one about starting the component is relatively obvious, but forIDStr, not so much. Is this an error about the otel collector not knowing about the component type?

swiatekm · 2025-11-27T16:37:21Z

internal/pkg/otel/manager/manager.go


+// reportStartupErr maps this error to the *status.AggregateStatus.
+// this is done by parsing the `m.mergedCollectorCfg` and converting it into the best effort *status.AggregateStatus.
+func (m *OTelManager) reportStartupErr(ctx context.Context, err error) {


One thing I don't like about this function, is that it pretends to report an error, but actually reports a status. We have different delivery requirements for these. Reporting errors is non-blocking, as a newer error can always overwrite an older error. The same is not true about statuses - we need to make sure all the component statuses are delivered, or else we get bugs like the one in #10675.

As a result, reportStartupErr needs to be called with care, as it can potentially deadlock the manager. I wonder if we should bite the bullet and just make the update channel buffered with some reasonable size, and emit a fatal error if it can't be written to. What do you think?

cmacknz

I ran the collector with the following configuration which should not run:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    #username: "elastic"
    #password: "changeme"
    preset: balanced
    otel:
      exporter:
        not_a_setting: true

It does not run as expected and the collector exits with the following which looks a lot like a multi-line error but we manage to grab the last line which at least contains the configuration key name:

{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'exporters' error reading configuration for \"elasticsearch/_agent-component/monitoring\": decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' has invalid keys: not_a_setting","ecs.version":"1.6.0"}

❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/bcc9882f-6e23-4ec6-b6a7-0abecb4c2ded
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]

That looks correct enough except for us leaking out the healthcheck extension status verbatim (but not beatsauth or the diagnostics extension) which we probably shouldn't do (or at least it is inconsistent).

cmacknz · 2025-11-27T19:23:50Z

internal/pkg/otel/manager/common.go

+// The flow of this function comes from https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/internal/graph/graph.go
+// It's a much simpler version, but follows the same for loop ordering and building of connectors of the internal
+// graph system that OTEL uses to build its component graph.
+func otelConfigToStatus(cfg *confmap.Conf, err error) (*status.AggregateStatus, error) {


This is complex enough to have its own tests. The configurations in manager_test.go right now don't cover it all e.g. there are no configurations with connectors.

To me it seems like tests for this should focus on mapping status for complex configurations and the tests in manager_test.go can focus purely on ensuring various failure scenarios result in capturing a status regardless of what the configuration is.

blakerouse added 7 commits November 24, 2025 14:46

Work on better error handling on failure of otel component.

86d5eeb

Merge remote-tracking branch 'upstream/main' into fix-11173

d097700

Add skeleton for handling this.

a95a0b1

Work on the otel config to status translation.

6994585

implement that mapping

c48eea2

Finish implementation.

edaee59

Add changelog.

e242f0c

blakerouse self-assigned this Nov 26, 2025

blakerouse requested a review from a team as a code owner November 26, 2025 16:41

blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Nov 26, 2025

blakerouse requested review from michel-laterman and pchila November 26, 2025 16:41

Fix race condition.

adbed0d

cmacknz requested a review from swiatekm November 26, 2025 18:57

swiatekm reviewed Nov 27, 2025

View reviewed changes

cmacknz reviewed Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report crashing OTEL process cleanly with proper status reporting #11448

Report crashing OTEL process cleanly with proper status reporting #11448

Uh oh!

blakerouse commented Nov 26, 2025 •

edited

Loading

Uh oh!

elasticmachine commented Nov 26, 2025

Uh oh!

elasticmachine commented Nov 26, 2025

Uh oh!

swiatekm left a comment

Uh oh!

swiatekm Nov 27, 2025

Uh oh!

swiatekm Nov 27, 2025

Uh oh!

swiatekm Nov 27, 2025

Uh oh!

swiatekm Nov 27, 2025

Uh oh!

cmacknz left a comment •

edited

Loading

Uh oh!

cmacknz Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		forIDStr := fmt.Sprintf("for id: %q", instanceID.ComponentID().String())
		failedMatchStr := fmt.Sprintf("failed to start %q %s:", instanceID.ComponentID().String(), strings.ToLower(instanceID.Kind().String()))

Report crashing OTEL process cleanly with proper status reporting #11448

Are you sure you want to change the base?

Report crashing OTEL process cleanly with proper status reporting #11448

Uh oh!

Conversation

blakerouse commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Uh oh!

elasticmachine commented Nov 26, 2025

Uh oh!

elasticmachine commented Nov 26, 2025

💚 Build Succeeded

History

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

swiatekm Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

swiatekm Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

swiatekm Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

swiatekm Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmacknz Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blakerouse commented Nov 26, 2025 •

edited

Loading

cmacknz left a comment •

edited

Loading